NF.DOC[4,KMC] - www.SailDart.org

perm filename NF.DOC[4,KMC] blob sn#168301 filedate 1975-09-17 generic text, type T, neo UTF8
Specifications and features for new FRONT for PARRY.
	Some are already done in the old FRONT.

***** CHOICE OF LANGUAGE **********
Is there any reason that the following can't be done in SAIL as easily as MLISP?
BILL  Historical reasons:
	Lisp's arbitrary data structures
		Especially good for retaining record of transformations performed.
	Lisp's interpreter for debugging

***** PHILOSOPHY **********

Treat SYNONM, IRREG, IDIOM, SPATS, & CPATS in uniform manner.
No more clear distinction between levels of processing
	(input, re-spelled, canonized, pattern, λ####).
Recently, I've been returning to some separate levels.  There is still no
distinction between idioms and simple patterns.
Combine pattern matching and parsing.
	Parse noun phrases.
	Pattern match clauses, with patterns containing specific verbs.
	Pattern match compound sentences ?
The following description sounds like a production system.

***** MATCHING ALGORITHM **********

Let's see what Terry W. has been thinking about along these lines?

Single look-up program finding longest match.  Try to start match at front of
input but skip over stuff to work on middle sometimes.  Skip when longest
match in table still has more stuff in it, but input differs.
Use remainder of pattern to choose from multiple meanings for following
word in input.
	e.g.	how's she doing
		HOW IS she doing
		how BE she doing
		how be THEY doing
		how be NURSES doing
		how be NURSE doing
		how be nurse DO
		λ1234

There were problems with the method above.  A lot of redundant work was done.
A word would be looked up repeatedly before it was finally replaced by a more
general word.  This is why some of the old idea of levels was brought back.
Now the processing looks like:
input:
		How's she doing?
words found:
		HOW IS SHE DO progressive ?
markers removed:		what=HOW, tense=progressive, mode=?
		BE SHE DO
pronouns resolved:		SHE → LINDA
		BE LINDA DO
auxiliary verb removed:
		LINDA DO
sentence structure found:
		λsubject-verb

Look-up algorithm must provide for word dropping.
A few meanings constantly change (ie "they", "bug"?)
	Must call function to compute meaning.

"Window" internal workings.
Accept multiple sentences at once.
Have all intermediate steps available until after response given.
	It's not as easy as it sounds because some rules pick up scattered
	parts of the sentence without involving intervening words.
	Keep a list of all transformations performed.
Allow additional information to be learned during interview.
	Might be hard to insert in permanent data storage, but could be stuck on.
***** SUFFIXES & SPELLING **********
Sources of dictionary entries:
	Present PARRY words: [PAR,RCP]
		SYNONM.ALF
		IRREG.ALF
		SUFFIX.ALF
	Present PARRY phrases: [PAR,RCP]
		IDIOM.ALF
		SPATS.SEL
		CPATS.SEL
		SOR[ENG,KMC]
		SORG[ENG,KMC]
	New NF words: [PAR,RCP]
		DICTIO
		PREFIX
		SUFFIX
	New NF phrases: [PAR,RCP]
		MULTI
	Unused words: [ENG,KMC]
		MSPEL2
		WORDS
		MAX
	Unused phrases: [ENG,KMC]
		SLANG
		SYNS
		SORN
		SORNEW
		SENTS
		MEM.DOC[HMF,RCP]

Clean up characters on input.

Use algorithm like present FRONT:
	1) Look up
	2) Suffix removal
		2.5) Prefix removal
	3) Re-spelling

Make sure that suffix is legal on that class of word
and what effect it has on the class. (i.e. part-of-speech → part-of-speech)

Gain information from prefixes & suffixes.
	Replace suffix with part of speech, number, tense, etc.

For spelling correction, see SPELL.REG[UP,DOC]
	It does 1 extra, 1 missing, 1 wrong, 2 transposed.
I will do:
	1 extra
	1 missing - add "E" at end (for suffixer)
	1 wrong - neighboring keys
		- "O" for "0"
		- "Y" for "I" at end (for suffixer)
	2 transposed

Keep count of spelling corrections and failures.
Handle run-on words.
Try to identify misspelling after suffix removal.  How?
	Need more elaborate rules for normal English suffixing rules.
	This involves classifying letters into about 10 nested groups
	(i.e. liquid-consonants).

***** PARTS OF SPEECH **********

Higher level classification of words:
	NOUN		for (DOES x GET ALONG WITH y)
	VERB
	PRONOUN		for reflexives, at least
	ADJECTIVE	for (I BE x)
Have a table (matrix) giving a λ#### for specific values of x & y.

Retain verbs in infinitive form.
Change all nouns to "NOUN".
Change all adjectives to "ADJECTIVE"
Is it possible to distinguish NOUN-VERB-ADJECTIVE?
	Not uniquely, but I could allow words in certain classes to fill
	more than one type of slot in patterns.
***** PHRASES **********

For disambiguation (related to idioms) see Stone's papers.

Want to temporarily replace a verb with "verb" , remove auxiliaries, then
	bring back specific verb.  Using fixed levels of reduction has
	eliminated this problem.

Have patterns containing specific verbs but with slots for arbitrary nouns.
	Somewhere (memory ?) would know the correct interpretation of each
	slot (i.e. SUBJECT, DIRECT OBJECT, etc.)

These are all done with patterns of the same form as idioms and simple sentence
patterns.

Verb phrase eliminator:  to turn all of:
	I GO
	I WENT
	I DID GO			into: I GO .
	I WAS GOING
	I HAVE GONE
	I HAVE BEEN GOING

Also interrogatives:
	DID I GO
	HAVE I GONE			into: I GO ?
	WAS I GOING
	HAVE I BEEN GOING

Also passives:
	I AM TAKEN
	I AM BEING TAKEN
	I WAS TAKEN			into: x TAKE I .
	I HAVE BEEN TAKEN
	I WAS BEING TAKEN

Also passive interrogatives:
	AM I TAKEN
	AM I BEING TAKEN
	WAS I TAKEN			into: x TAKE I ?
	HAVE I BEEN TAKEN
	WAS I BEING TAKEN

The following verbs are dropped earlier

Modal verbs might also be removed:
	WILL
	CAN
	SHOULD
	OUGHT to
	NEED to
	WANT to
	TRY to

What are:
	SEEM to be
	LIKE to

Have a pre-determined set of flags indicating what was removed.

How will I gobble up a "noun phrase" in the input?
	Trigger on an article or adjective or noun and keep going
	until the next verb or preposition.

***** OTHER **********

Represent both "start of sentence" and "end of sentence" in patterns.
What about "not"?  Drop it and set a flag. (This always worked before)
Learn about unknown words from context.
	Patterns specify restrictions on part of speech.
	Picking up Doctor's name is an instance of this?
This is related to predictive look-up.  This would also be good for disambiguation.
	(Knowing a word should be taken as a noun because of prior syntax.)
	Unfortunately, it clashes with fixed levels of processing.  Must
	do complete processing on initial segments to predict meaning of
	following words.

Ellipsis problems.  Try partial match with previous sentence(s) to discover
missing parts.

What about "dialogue patterns" like question→answer, HELLO→HELLO.

KEN..CAN PARRY PATTERN MATCH HIS OWN OUTPUT?
	Only if the Dr might also say it.
HOW MANY AMBIGUITIES IS HE STILL MISSING?
	When they arise, patterns are included to distinguish important meanings.

Write prototype program.  Make sure it can do everything FRONT does?!
BILL Dont need to do everthing FRONT does.  Only the common easy ones.
PRESENTATION FOR RUTGERS:

DEFINITION OF ENTIRE PROBLEM:
	"The Natural Language Problem"
To  accept  ANY  ENGLISH  input  and  recognize  it  with  SUFFICIENT
RESOLUTION  to  allow  the memory  part  of  the  program to  perform
reasonably.    I  use  "sufficient  resolution"  to  imply  that  some
statements which are superficially very similar must be distinguished
as representing  distinct ideas to the program, while other groups of
statements  which  are  superficially  diverse  can   be  treated  as
equivalent ideas.
(example: HOW OLD IS YOUR MOTHER? is distinct from HOW OLD IS YOUR FATHER?
but HOW OLD IS FRED SCWWARTZ? and WHAT IS JOE BLOW'S AGE? are equivalent.)

DEFINITION OF CHOSEN SUB-PROBLEM:
	"Idiom problem"
Everyday English  is full  of phrases  which are  not recognizable  or
reconstructable from their parts.  The structure of these phrases is
much more rigid than "normal" English.
These phrases have [no slots,				OUT ON A LIMB
		    constrained slots (pronoun only),	some example from SYNS
		    loose slots (noun phrase),		{get} ON {noun}'S NERVES
		    disconnected parts (verb-particle)]	{pick} {noun phrase} UP

GIVEN: PRACTICAL SIMPLIFICATIONS
Not really ANY ENGLISH;
	Slot filling noun phrases must be short. (i.e. prepositional phrases
		and relative clauses are rare.)
	but Verb-particle might be separated by NP with relative clause?
SUFFICIENT RESOLUTION is not as fine as you might imagine;
	Human performance of this task provides some justification for
only finding what you want to (expect to) find in a sentence.
	People sometimes seem to get the "drift" of a statement without
thinking about exactly what it means and what it implies.

ISSUE: RESOLUTION
	Should you do complete analysis and use rules to select present
meaning (interpret in present context) or just seek what you want and
fail badly (not even see syntax) if words not used as expected.
	Understand all or settle for less, like "insult" or "topic=family"?

ISSUE: Separation of "recognition" and "memory"
	Does recognizer do semantic processing like "tea in china → game"
or just syntax analysis and leave meaning up to Memory.

PREVIOUS "SOLUTIONS":
1)	Context-free grammars
2)	A.F.S.T.N. (Woods)
3)	conceptual parsing (Riesbeck)
4)	pattern matching (PARRY)
5)	keyword search (old PARRY, MYCIN)

Bearing in mind that almost any program can be written in almost any formalism:
1) Is too rigid.
2) Also insists on examining all syntax.  Might be modified to look for
	idioms but it would produce unusual looking AFSTN's.
3) Is too literal-minded as it has been done so far.  Again, it could be altered.
4) Misses the regularity of English syntax which does exist.
5) Totally ignores extra words (and clauses) and word order (syntax).
***** DATA STORAGE FORMAT *********

All jumbled together in one huge table?  Wastes space(or search time).
Fixed (maximum) size entries (and values) uses too much room.  This can be
avoided by separating tables by size.  Then getting longest match
requires searching tables for all sizes (starting at biggest) until found.
Would a compromise get the best of both or the worst of both?  For example;
just a few (= 2 or 3) tables for different sizes of data.
Arbitrary size entries (and values) requires finding special marker characters
which separate stuff.  This makes binary search harder and lots slower.

TWO IDEAS FOR STORING THE FRONT END'S TABLES:

***** 1)  SIMPLE AND LOTS OF DISK READS **********

The tables are stored on disk.  Assuming each entry is composed of 36-bit words
of 5 ascii chars each, store each entry packed in right next to the rest.
The first word of each is either a number telling how many words the entry is,
or a special word indicating a break between entries.

	In core is a table listing:
	 1) the first entry on a disk record,
	 2) the address on the disk of the disk record.

The lookup consists of a binary search thru the table for the proper
disk record, reading in the disk record in dump mode, and searching
linearly the disk record for the proper entry (linear search).

If the table in core is too long, it can be cut in half by reading in 2 records
from disk and having the entries to every other record in the table.
And the table can be halved again...

This method takes one disk read per entry lookup, assuming the entire
index table is in core.   This may be a problem, because each entry
in the table is an arbitrary length, although realistically it may
be possible to distinguish between two entries by the first five words.



***** 2) CLEVER AND SPACE-SAVING AND PROGRAMMING CHALLENGE **********

	This looks more like a transition net implementation.

Everything is in core.

The first word of each entry is stored in an array.  The array entries consist
of this first word of ascii, and a pointer to the array of second ascii words
which can follow this particular first one.   etc.  The final pointer is flagged
as a pointer to the λ or whatever.

The lookup consists of binary searching the top-level array on the first ascii
word to be matched.  The pointer to the next-level table is used, and again
a binary search.   etc.

This method is the most compact that REM and I could think of.  The access
might be faster than the previous one.  One problem is programming.  The
other one is setting up the data structure.  It is possible, but a fascinating
task in itself.
Ideas on a new recognizer

We could have two modules: a recognizer and a memory-inferencer.
The recognizer would have linguistic information, such as all closed class
words and idioms (a la SYNS).  The memory would have the possible semantic
relationships that could occur, such as (DOCTORS BOTHER) and
(SICK PERSON) for all the open class words, as well as all of the synonyms,
excepting idiomatic synonms would be in the recognizer (SYNS).

Example:
 Input - Do the hospital psychiatrists get on the other patients nerves?

Information in the recognizer:

GET ON *NP* NERVES →→ BOTHER *NP*
DO *NP* *VP* →→ (YESNO (*NP* *VP*))


Information in the memory (which the recognizer will have to ask for):

PSYCHIATRIST →→ DOCTOR
(HOSPITAL DOCTOR) →→ (DOCTOR (ATTRIBUTE HOSPITAL))
(DOCTOR BOTHER *NP*) →→  (BOTHER DOCTOR *NP*)     traditional predicate form


The output of this process might be:

(YESNO (BOTHER (DOCTOR (ATTRIBUTE HOSPITAL)) PATIENTS))

			* * * * * * * * * *

The recognizer would have the following information:

a) everything that is in SYNS, e.g. patterns of verbs and particles,
idiomatic word patterns.

b) the necessary patterns or grammar to remove the interrogatives from the
start of the sentence, and inflections from the inflected verb.  E.g., take
the WHY, HOW, and HOW LONG off and there is ususally and YESNO question left.
Similarly, WHO, WHAT, and WHERE can be lumped together.  There should be
similar consolidation for relative clauses.

c) the markings for closed class words, e.g. as follows:
 prepositions and particles
 modal auxiliaries
 conjunctions
 interrogatives
 relative pronouns
 pronouns
 determiners
   All of these words have unambiguous markings.

d) (possibly) the markings for the irregular verbs, since they tend not to be
used as adjectives or nouns.

e) (possibly) adverbs - sweepable?

  Note that all the other words are either nouns, verbs, or adjectives (all the
open class words).  We may be able to distinguish these by the information
in the memory, which is as follows.

  The memory would have this information:

a) synonyms for the open class words

b) a set of potential-attribute patterns, and a set of potential-action patterns.
Potential attributes are what can modify what as an attribute.    Eg
(hospital doctor), (your doctor), (sick person).  Potential actions are
what can do what.  Eg (mafia scare), (mafia bug).
[What about direct objects?]

c) Inferences that will take the recognizer's output and transform it into
something that the response module can use.  The recognizer should be able
to recognize (BOTHER X Y) for any X and Y that are human, although the
response module may not want to deal with them all.  At the start this can
be a fairly simple mapping.  Later hopefully this can be expanded to do
general inferences on anything that can be recognized.

			* * * * * * * * * *

Characteristics of the current pattern matcher that we want to keep:

1) Extra words are thrown out easily, without extensive search.
Question: what is the origin of the extra words --
 a) spelling errors   WHWHAT WHAT IS YOUR NAME
 b) spurious modifiers  THE OTHER PATIENTS, THE STAFF NURSES
 c) wordy constructions  A SICK PERSON, HOW DO YOU REALLY FEEL JUST RIGHT NOW

2) Extra patterns are thrown out easily, and the recognizer gets meaning
from part of the input, even if another part is totally lost.

There seem to be 2 types of search going on right now, and they may or may not
be combined in the new recognizer:
1) SPAT - we have a definite pattern, and we look it up in the tables to
either find the exact or the closest match.
2) DROPONE - we have a sequence, and we take various subsequences and look
at them.


  The final output of the recognizer is a problem.  It would be nice to have
a general output, so that an inferencer could then use the input directly
to make inferences and extract the info it was looking for.  But we dont
want the output too messy, or else the inferencer has to work just as
hard to interpret it as the recognizer did originally.  The general output
(e.g., (YESNO (THINK YOU (TRY I (HURT I YOU)))) for "do you think I am trying
to hurt you") could be useful if we can develop some rules for throwing out
most of the time all the extra info that we usually dont need, and only
allowing all the info to be analyzed in special cases when the inferencer
needs it.


  The interaction between a pattern-matching mode and a parsing mode may
be the hard problem.  It may be that they call each other recursively,
the pattern matcher having patterns like the following that are tried
in order:
  (GO JUMP IN THE LAKE)
  (GO AHEAD)
  (DO *NP*
  (*NP* *VP*)

I.e., the pattern matcher would try every specific thing it could, calling
the parser to fill in the indefinite items, and then call the parser if
the sentence seemed to be a classic with no patterned phrases at the start.
But then the parser must call the pattern matcher at each step to check for
special patterns inside the set of input which is being parsed.